home *** CD-ROM | disk | FTP | other *** search
Text File | 1991-09-29 | 55.5 KB | 1,015 lines |
- A Comparison of the Network Time Protocol and Digital Time Service
-
- David L. Mills
- University of Delaware
- 12 February 1990
-
- Review by Joe Comuzzi, DEC
- Further Commentary by Dave Mills, UDel
- 18 March 1990
-
- Following is a review and commentary on the above document, which is
- available in the file pub/ntp/dts.txt on louie.udel.edu. This document
- is available in the file pub/ntp/dtsrev.txt on the same host. The
- original document is based on the DTS specification version T1.0.5 dated
- 18 December 1989, which I assume can be obtained from DEC.
-
- At my suggestion Joe Comuzzi of DEC thoroughly and incisively reviewed
- my document comparing NTP and DTS. He found some agreement, some
- disagreement and some errors on my part. I much appreciate the
- time and care this effort required. In the same spirit, I have reviewed
- his comments and responded with comments of my own. As time permits I
- intend to incorporate appropriate revisions into the body of the original
- document and submit for wider distribution. Meanwhile, I offer the
- following discourse for further comment and evaluation. Personally, I
- have found the exchange useful, stimulating and suggestive of further
- refinements to NTP.
-
- The following discourse includes only those portions of the original
- document that are relevant to the reviewer's comments. These are indented
- three spaces. The reviewer's comments are flush with the left margin.
- These comments are included in their entirety and are unedited. My reply
- comments are preceded by a ">" symbol. References to the latest
- specification are to RFC-1119, with exception of the mention of new
- appendices in the revised version of February 1990, which can be found in
- the PostScript file pub/ntp/ntp.ps on louie.udel.edu.
-
- -------------------------------------------------------------------------
-
- The Digital Time Service (DTS) for the Digital Network Architecture
- (DECnet) is intended to synchronize time in computer networks ranging in
- size from local to wide-area.
-
- You seem to be trying to clothe DTS in a propritary cloth. We now refer to
- DECnet as DECnet/OSI since we've incorporated OSI protocols into the
- protocol stack. It is our intention to pursue DTS in the OSI standards
- forums.
-
- > I have no intent to clothe DTS in anything other than explicitly stated
- > on the cover and introduction to the spec document. There is says "DNA
- > Phase 5 network." I will be glad to preach any other gospel or creed
- > practiced by DEC's men of cloth if you will change the cover and
- > introduction to the spec.
-
- As such it is intended to provide service
- comparable to the Network Time Protocol (NTP) for the Internet
- architecture.
-
- While both are clearly addressing the same problem space, DTS and NTP
- have VERY different goals. I recently spoke to the president of a time
- provider manufacturer and I liked his jargon, he distinguished between the
- time-of-day market and the frequency market. The time-of-day market wants
- to know what time it is, it is not interested in small errors and it
- doesn't want to pay a lot. The frequency market wants stable frequency
- sources, needs high stability and is willing to pay.
-
- > I didn't know the time providers distinguished between the time-of-day
- > market and frequency market. Certainly their customers don't know the
- > difference. No timecode receivers known to me have the requisite
- > stability to be considered primary frequency providers in any case;
- > that's what rubidium and cesium standards are for. I do not understand
- > the basis for your conclusion that accurate frequency costs more than
- > accurate time. While the algorithms are somewhat more complicated and the
- > host-clock implementation must be more rigidly specified, this does not
- > necessarily cost more, especially if there is almost a decade of research
- > in refining the methodology.
-
- NTP is a solution for the frequency market. DTS is only interested in
- the time-of-day market. The major cost for these solutions is not
- the initial capital investment, but the long term management and operation
- cost. As such DTS has goals of auto-configurability and ease of management
- which are not present in NTP.
-
- > If you are convinced that accurate, reliable time-of-day service can be
- > achieved without consideration for frequency and believe that errors as
- > much as several seconds per day in the absence of connectivity are
- > acceptable, then I won't argue with DTS being a reasonable approach. I
- > accept that NTP has goals primarily of stability, accuracy and
- > reliability and secondarily of configurability and ease of management,
- > since other Internet protocols would be expected to provide those
- > functions (see below).
-
- > (portion deleted)
-
- The goal of a distributed timekeeping
- service such as NTP and DTS is to synchronize the clocks in all
- participating servers and clients so that all are correct, indicate the
- same time relative to UTC, and maintain specified measures of stability,
- accuracy and reliability.
-
- As stated above, DTS is addressing the time-of-day market hence high
- frequency stability is an not a goal of DTS.
-
- > Do you mean that "specified measures of stability, accuracy and
- > reliability" do not apply to DTS? Should I specifically point out that
- > stability is a non-goal of DTS? A stability bound is in fact an
- > architectural constant "maxDrift" in DTS, which sounds like a
- > "specified measure" to me.
-
- > (portion deleted)
-
- Servers, both primary and secondary, typically run NTP with several
- other servers at the same or lower stratum levels; however, a selection
- algorithm attempts to select the most accurate and reliable server or
- set of servers from which to actually synchronize the local clock. The
- selection algorithm, described in more detail later in this document,
- uses a maximum-likelihood clustering algorithm to determine the best
- from among a number of possible servers. The synchronization subnet
- itself is automatically constructed from among the available paths using
- the distributed Bellman-Ford routing algorithm [BER87], in which the
- distance metric is modified hop count.
-
- Note that in DTS loops are not a problem, if a system sends out a time
- an ultimately gets back a derived time, due to the communication delays
- the derived time will always arrive back with a larger inaccuracy.
- The only exception to this is the possibility of a system with a time
- provider and a lousy clock. Then the derived time's inaccuracy could be
- smaller if the time was parked in a system with a good clock. But in
- this case the network clearly has information that the original system
- has lost.
-
- > It would seem that the strategy to avoid subnet loops is similar in both
- > NTP and DTS, although in NTP the metric is stratum (hop count) and in
- > DTS it is the inaccuracy interval (is there a better word than
- > "inaccuracy" with a more positive connotation?) Both NTP and DTS
- > appear to operate in similar ways to cast out noisy timecode receivers,
- > although it is not clear to me how the DTS manager determines from the
- > protocol and the radio what the inaccuracy interval should be. Both NTP
- > and DTS model the receiver similar to an ordinary peer, presumably with
- > smallest inaccuracy interval or lowest stratum. In principle, both could
- > estimate these and related information directly from the timecode
- > samples.
-
- > (portion deleted)
-
- The NTP specification includes no architected procedures for servers to
- obtain addresses of other servers other than by configuration files and
- public bulletin boards.
-
- This is a serious short-coming of NTP and definitely makes it harder to
- manage. It is unclear to me why you haven't fixed this since it would
- not seem that difficult to store server names in a namespace.
-
- > There are three issues here: (1) how to discover a set of time-servers
- > which are potentially useful peers, (2) how to intelligently select
- > an appropriate subset, based on performance expectations and (3) how
- > to translate names to addresses. Internet protocols are notoriously
- > weak on (1) and (2); however, (3) is a non-issue with NTP, since all
- > NTP daemons use the Internet DNS to resolve addresses from names and in
- > principle could use the DNS to discover servers (WKS records). For (1)
- > now, there is a master file on an obscure host, which is updated
- > haphazardly at irregular intervals using completely unauthenticated
- > data obtained from unreliable sources. Issue (2) is Real Hard when
- > the number of potential peers runs in the thousands and considerations
- > of network overhead, access policy and export control (drat DES) are
- > involved.
-
- > DTS uses LAN discovery protocols and automatic global server registration
- > in a global database, which vastly simplifies (1) and (3); however, I
- > submit that, as DTS gets bigger, (2) will become as hard in DTS as it
- > has in NTP. For instance, survey evidence suggests there are over 2000
- > hosts supporting NTP and potentially available as servers registered
- > in the DNS. Using the DTS model that flushes the server list every 12
- > hours and expects that every server and clerk maintains the entire
- > list, one might expect a good deal of network clanking, unless the list
- > were pruned and stratified as a cooperative management exercise.
-
- While servers passively respond to requests from
- other servers, they must be configured in order to actively probe other
- servers. Servers configured as active poll other servers continuously,
- while servers configured as passive poll only when polled by another
- server. There are no provisions in the present protocol to dynamically
- activate some servers should other servers fail.
-
- This is harder to fix and interacts with the spanning tree. Here at least
- I can see why you didn't make it easier to manage.
-
- These problems make NTP a system administrators nightmare, but are
- consistent with the two different sets of goals. Consistent with DTS goals
- we've accepted some "clock hopping" in exchange for ease of management.
-
- > I'm not sure what you mean by "nightmare." Most NTP administrators
- > snarf a copy of one of the two Unix daemons, compile it locally,
- > make an uneducated guess which existing server(s) in the master list to
- > use based on advice included in the distribution, build a simple
- > configuration file, turn the keys and walk away. In fact, DEC is
- > presently distributing NTP with Ultrix and includes a five-page writeup
- > on how to do this; which, although not an engineered solution, would
- > not ordinarily be considered a nightmare.
-
- > While you and I might consider NTP configuration crude, it is really no
- > better or worse than bringing up a j-random router or DNS server. In DTS
- > clients and servers wake up once in a while and solicit time in
- > connectionless mode on LANs and connection mode on WANs, while in NTP
- > peers solicit time continuously at controlled rates in connectionless
- > mode. On the issue of "dynamically activate," it appears that DTS does
- > just that with backup couriers in order to minimize WAN overhead.
- > This is a good thing and should be done in NTP. Dynamic activation is
- > on my list, but not above integration with the IP multicast service.
-
- In response to stated needs for security features, NTP includes an
- optional cryptographic authentication mechanism. NTP also includes an
- optional comprehensive remote monitoring mechanism found necessary for
- the detection and repair of various problems in server and network
- configuration and operation. It is anticipated that, when generic
- features capable of these functions have been developed and deployed in
- the Internet, the NTP authentication and monitoring mechanisms may be
- withdrawn.
-
- > This might be called poor-boy network management; expedient and ugly,
- > but necessary. An SNMP interface is in progress for one of the Unix
- > daemons. Same goes for the authentication mechanism, which is a
- > necessary feature used to partition the subnet for repair when
- > a server comes unglued.
-
- < (portion deleted)
-
- In DTS a synchronization subnet consists of a structured graph with
- nodes consisting of clerks, servers, couriers and time providers. With
- respect to the NTP nomenclature, a time provider is a primary server, a
- courier is a secondary server intended to import time from one or more
- distant primary servers for local redistribution and a server is
- intended to provide time for possibly many end nodes or clerks. Time
- providers, servers and couriers are evidently generic, in that all
- perform similar functions and have similar or identical internal
- structure.
-
- Not only are they generic, they are dynamic. If a time provider system
- loses its radio signal, it immediately reverts to a server, providing
- graceful degradation in the presence of failures.
-
- > Your enthusiasm is contagious. NTP does exactly the same thing.
-
- The DTS story is actually even better here, we provide a well defined
- time provider interface. This can be used to implement a time provider
- without requiring modification of the protocol portions of the time
- service. (On Unix systems it uses Unix domain sockets). This greatly
- eases adding a new time provider, and permits time provider vendors to
- supply it with their hardware. Note, NTP could (and probably should) do
- this also. We have already done it.
-
- > The NTP spec includes a procedure for time provider interface, although
- > the entity interactions are only informally specified. However, the NTP
- > interface is substantially the same as the peer interface, while in DTS
- > the interface is different. Perhaps the most interesting difference is
- > that the DTS provider interface expects a series of time values and
- > uses the DTS procedures to refine the estimate, which is similar in
- > intent to the NTP clock filter, but the NTP clock filter applies to
- > all peers in addition to the provider.
-
- > As specified in the introduction, the NTP spec is not intended as a
- > formal one (in the best and worst Internet traditions). However, we
- > have a little project at UDel to rewrite it in Estelle and throw test
- > cases at it. The project has already found a small number of minor
- > sleazes and obscurisms. You are to be congratulated in your formal
- > approach using Modula2+. Have you subjected the protocol description
- > to formal verification and testing? Can you make your Unix daemon
- > available for testing? Would you agree to publish the spec document
- > as an RFC?
-
- As in NTP, DTS clients and servers periodically request the time from
- other servers, although the subnet has only a limited ability to
- reconfigure in the event of failure.
-
- I don't understand this statement. Reconfiguration within a LAN is
- about as complete as one could imagine. The random selection of global
- servers is robust against any non-partitioning WAN failures.
-
- > My statement was misleading and should be clarified. Assuming the
- > global directory service is robust, DTS certainly is robust against
- > non-partitioning WAN failures; however, there are only three levels
- > in the DTS subnet (global server, courier/server, client). In NTP there
- > can be several levels or strata (commonly up to five or more). My comment
- > was meant in the context of reforming the NTP subnet as a spanning
- > tree routed at the primary servers when something croaks. This of course
- > requires engineered peer paths and prior knowledge of WAN connectivity,
- > which is certainly not among the goals of DTS.
-
- > (portion deleted)
-
- On local nets DTS servers multicast to each other in order to construct
- lists of servers available on the local wire. Clerks multicast requests
- for these lists, which are returned in monocast mode similar to ARP in
- the Internet. Couriers consult the network directory system to find
- global time providers. For local-net operation more than one server can
- be configured to operate as a courier, but only one will actually
- operate as a courier at a time.
-
- This is false, I think you're failing to distinguish between couriers and
- backup couriers. There can be more than one courier per LAN, each will
- always synchronize with at least one member of the global set. Backup
- couriers use an election algorithm in the absence of a courier. Only one
- backup courier will be elected to function as a courier.
-
- > Correction noted. Do you always expect to have multiple couriers (other
- > than the single elected backup) in order to insure diversity and
- > redundancy anyway? The local servers check each other for consistency
- > and those set as couriers read at least one, but not necessarily more
- > than one, global clock.
-
- There does not appear to be a multicast
- function in which a personal workstation could obtain time simply by
- listening on the local wire without first obtaining a list of local
- servers.
-
- That is correct, it would violate the principle that a message exchange
- has to happen in order to correctly assign an inaccuracy.
-
- > There appears to be a considerable Internet constituency which has
- > noisily articulated the need for a multicast function when the number of
- > clients on the wire climbs to the hundreds. Having responded to the
- > articulation noise, I thought it might be a reasonable idea to include
- > this capability (so far untested) on LANs with casual workstations,
- > promiscuous servers and simple protocol stacks.
-
- > (portion deleted)
-
- Perhaps the widest departure between the NTP and DTS philosophies is the
- basic underlying statistical model. NTP is based on maximum-likelihood
- principles and statistical mechanics, where errors are expressed in
- terms of expectations. DTS is based on provable assertions about the
- correctness of a set of mutually suspicious clocks, where errors are
- expressed as a set of computable bounds on maximum time and frequency
- offsets. This section explores these models and how they affect the
- quality of service.
-
- > You chose not to respond to the statistical models presented. Does that
- > mean you are in substantial agreement with the exposition?
-
- > (portion deleted)
-
- Both NTP and DTS exist to provide timestamps to some specified accuracy
- and precision. NTP represents time as a 64-bit quantity in seconds and
- fractions, with 32 bits as integral seconds and 32 bits as the fraction
- of a second. This provides resolution to about 200 picoseconds and
- rollover ambiguity of about 136 years. The origin of the timescale and
- its correspondence with UTC, atomic time and Julian days is documented
- in [MIL90c]. DTS represents time to a precision of 100 nanoseconds,
- although there appears to be no specified maximum value.
-
- The DTS time is a signed 64 bits of 100 nanoseconds since Oct 15, 1582.
- It will not run out until after the year 30,000 AD. Unlike NTP which
- will run out in 2036. I, for one, intend to still be alive in 2036!
-
- There are two reasons the 100 ns. was chosen:
- 1) We want to use these timestamps as a time representation, for
- filesystem timestamps, etc. We REALLY don't want to deal with the
- problem that our representation is inadequate in some reasonably
- future time. Also, since the 64 bits is signed, times back to
- 28,000 BC can be represented. This is potentially useful for
- astronomical data, and happily, includes all of recorded history.
-
- If we decreased the resolution, we would give up range. This choice
- seemed like a reasonable compromise.
- 2) Since we include the the transmission delay in the inaccuracy, 100 ns
- represents only 30 meters. Its not meaningful to talk about
- synchronizing clocks below that level with our algorithm. (I believe
- its not meaningful to talk about synchronizing clocks below that
- level with NTP either).
-
- The total timestamp is 128 bits, this includes a four bit version number
- field which would permit these decision to be revisited in the future.
-
- > I won't argue with your choice of timestamp format. My choice was
- > conditioned both by pragmatic issues of compatibility with other Internet
- > timekeeping protocols, as well as a perceived need to operate at the
- > highest accuracies and precisions capable of national laboratories. As
- > for synchronizing clocks with NTP below the 100-ns level, a project to
- > do exactly that is in progress here to compare LORAN-C and cesium time.
- > Note that not all NTP subnets operate using general-purpose computing
- > systems. My own zeal in pursuing the ultimate accuracy and precision
- > is largely conditioned by our ongoing work in gigabit network routing
- > and network synchronization.
-
- > In any case, the DTS timestamp format including inaccuracy and version
- > is a good idea. In principal, the inaccuracy is available in NTP in the
- > form of the synchronization distance and dispersion, but this is not
- > normally available at the Unix interface.
-
- > (portion deleted)
-
- With respect to applications involving precision time data, such as
- national standards laboratories, resolutions less than the 100
- nanoseconds provided by DTS are required. Present timekeeping systems
- for space science and navigation can maintain time to better than 30
- nanoseconds, while range data over interplanetary distances can be
- determined to less than a nanosecond. While an ordinary application
- running on an ordinary computer could not reasonably be expected to
- expect or render precise timestamps anywhere near the 200-picosecond
- limit of an NTP timestamp, there are many applications where a precision
- timestamp could be rendered by some other means and propagated via a
- computer and network to some other place for processing. One such
- application could well be synchronizing navigation systems like LORAN-C,
- where the timestamps would be obtained directly from station timekeeping
- equipment.
-
- There is an obvious inconsistency in your position here. If you're just
- using the NTP time format for synchronization, then talking about 136 year
- rollovers makes some sense. It could be hidden from the users by extending
- the protocol. If, however, as this paragraph implies you intend the NTP
- time format as a general timestamp, then there will be extreme pain in the
- year 2036. (This is refered to in DEC as the "date75" problem!) To avoid
- this without unduly extending the timestamp DTS has traded off being able
- to use its timestamp format for certain highly precise applications.
-
- > I have vivid memories of shout-out meetings in the early eighties
- > where we Interbums staked out positions on what you call the "date75"
- > problem. It seems that, no matter what resolution and rollover parameters
- > you select, somebody will complain the Big Bang or End of Time cannot
- > be represented to femtoseconds. For that matter, while my personal clock
- > may expire before 2036, even now I have great pain keeping track with
- > conventional date notation of investments that mature after the century
- > turns. In NTP I chose to explicitly and purposely leave out the 136-year
- > disambiguation function and relegate that to a higher protocol that
- > includes both this function and leap-second recording in network
- > institutional memory. Since the Earth is winding down in an unpredictable
- > way and papal bulls cannot endure forever and we haven't even got the
- > Julian days and Gregorian centuries consonant yet, I concluded that
- > life is too short and, like astronomers, we all should have used
- > (modified) Julian day-fraction reckoning in the first place.
-
- > (portion deleted)
-
- NTP specifically and intentionally has no provisions anywhere in the
- protocol to specify time zones or zone names. The service is designed to
- deliver UTC seconds and Julian days without respect to geographic
- position, political boundary or local custom. Conversion of NTP
- timestamp data to system format is expected to occur at the presentation
- layer; however, provisions are made to supply leap-second information to
- the presentation layer so that network time in the vicinity of leap
- seconds can be properly coordinated. DTS includes provision for time
- zones and presumably summer/winter adjustments in the form of a
- numerical time offset from UTC and arbitrary character-string label;
- however, it is not obvious how to distribute and activate this
- information in a coordinated manner.
-
- The information is used only as a help in user displays. That is, an
- application can display BOTH the UTC time and the local time at which
- a timestamp was created. It only cost 12 bits to do this. No use is
- made of the timezone information by DTS or by systems.
-
- > That clarifies the issue. Your intent is only to qualify the origin
- > of the timestamp. Point noted.
-
- NTP and DTS differ somewhat in the treatment of leap seconds. In DTS the
- normal growth in error bounds in the absence of corrections will
- eventually cause the bounds to include the new timescale and adjust
- gradually as in normal operation. Recognizing that this can take a long
- time, DTS includes special provisions that expand the error bounds at
- such times that leap seconds are expected to occur, which can shorten
- the period for convergence significantly. However, until the correction
- is determined and during the convergence interval the accuracy of the
- local clock with respect to other network clocks may be considerably
- degraded.
-
- The accuracy and stability expectations of NTP preclude this approach.
- In NTP the incidence of leap seconds is assumed available in advance at
- all primary servers and distributed automatically throughout the
- remainder of the synchronization subnet as part of normal protocol
- operations. Thus, every server and client in the subnet is aware at the
- instant the leap second is to take affect, and steps the local clock
- simultaneously with all other servers in the subnet. Thus, the local
- clock accuracy and stability are preserved before, during and after the
- leap insertion.
-
- Each server has to maintain and propagate this state before the leap
- insertion. This is, of course, subject to Byzantine failures. A failing
- server can insert a bad notification.
-
- > Did I miss something? By "propagate this state" do you mean DTS will
- > propagate advance notice of leap seconds? From what I can find rummaging
- > over the text, it appears that entities are expected to add one second
- > to their inaccuracy intervals at the end of June and December, which
- > would certainly shorten the convergence period if a leap did in fact
- > occur; However, there will be an unpredictable interval following that
- > when the clocks are all scurrying to catch up and network time can
- > be inconsistent up to a second. I worry about Byzantine failures, too.
- > That's why all those NTP timestamp consistency tests and, ultimately,
- > the NTP authentication scheme. It would appear that DTS is vulnerable
- > to replay in the same way NTP is vulnerable without this scheme.
-
- > (portion deleted)
-
- At first glance it may appear that NTP and DTS have quite different
- models to determine delay, offset and error budgets. Both involve the
- exchange of messages between two servers (or a client and a server).
- Both attempt to measure not only the clock offsets, but the roundtrip
- delay and, in addition, attempt to estimate the error. The diagrams
- below, in which time flows downward, illustrate a typical NTP message
- exchange in each protocol between servers A and B.
-
- A B A B
-
- | | | |
- t1 |--------->| t2 t1 |--------->|--- t4
- | | | | |
- | | | |
- | | | | w
- | | | |
- | | | | |
- t4 |<---------| t3 t8 |<---------|---
- | | | |
-
- NTP DTS
-
- In NTP the roundtrip delay d and clock offset c of server B relative to
- A is
-
- d = (t4-t1) - (t3-t2)
- c = ((t2-t1) + (t3-t4))/2.
-
- This method amounts to a continuously sampled, returnable-time system,
- which is used in some digital telephone networks [LIN80].
-
- The derivation of the expression for 'c' above assumes the two transit
- delays for this exchange are symmetric. If there are systematically
- asymmetric transmission delays then the NTP algorithm will shift the two
- clocks so that they appear to be synchronized, when in fact they are
- systematically off by some number of milliseconds. The NTP minimum
- filter attempts to minimize this effect assuming that the shortest round
- trip exchange would have to be symmetric or nearly so. Unfortunately quite
- large systematic asymmetric delays can occur for a variety of reasons:
- source-routed networks, broken routing tables, etc. and these would apply
- to all transactions including the shortest. This problem exists in DTS
- also, but in DTS both of the systems will have an inaccuracy which
- encompasses the correct time. That is, DTS will not claim to have
- synchronized clocks to a level which it has not, even in the presence of
- asymmetric delays. NTP can and has.
-
- > Your observation on asymmetric paths leading to undetectable systematic
- > errors with both NTP and DTS is correct and is routinely observed to
- > varying degrees on the Internet. In fact, leaving out adjustments
- > necessary for frequency offset and precision (in both NTP and DTS) the
- > above formulas can be rewritten as presented in the DTS spec. We have
- > a project here designed to collect offset data from many or even all
- > subnet servers at non-intrusive rates in order to detect and correct
- > for asymmetric paths using correlation techniques.
-
- > I'm not sure what you mean by "NTP can and has" claimed to "have
- > synchronized clocks to a level which it has not, even in the presence
- > of asymmetric delays." NTP does not claim to synchronize to any level,
- > only to minimize the level of probabilistic uncertainty and estimate
- > the error incurred. In any case, what NTP calls the synchronizing
- > delay represents in fact the error bound relative to the synchronizing
- > path to the primary source.
-
- > These are probabilistic data and must be interpreted with respect to the
- > probability model which applies to real Internet paths. It may be that,
- > with an appropriate queueing model and assumed distribution functions,
- > a quantitative error probability function could be derived. Having
- > travelled those roads before myself, I conclude my pragmatic approach
- > to error estimation is probably as good as any. See [ALL74] for an
- > alternative approach.
-
- > (portion deleted)
-
- Both NTP and DTS have to do a little dance in order to account for
- timing errors due to the precisions of the local clocks and the
- frequency offsets (usually minor) over the transaction interval itself.
- A purist might argue that the expression given above for delay and
- offset are not strictly accurate unless the probability density
- functions for the path delays are known, properly convolved and
- expectations computed, but this applies to both NTP and DTS. The point
- should be made, however, that correct functioning of DTS requires
- reliable bounds on measured roundtrip delay, as this enters into the
- error budget used to construct intervals over which a clock can be
- considered correct.
-
- However, this is not at all hard to compute. Simply increase the inaccuracy
- by the potential drift of the local clock during the transaction. The
- architecture specifies this.
-
- > Not hard to do in NTP either, as the architecture specifies. The
- > difference is that in NTP this is represented as a time-insensitive
- > bound, since the architecture expects the local-clock algorithm to
- > compensate for frequency errors. The system expectation is that the
- > (corrected) local clock does not wander more than an architectural
- > constant of 30 ms per day. Even in NTP it might be a good idea
- > to ratchet up the imputed skew when all sources are lost and the
- > bandwidth of the tracking loop is relatively large. This will be
- > considered in future.
-
- > (portion deleted)
-
- NTP maintains for each server both the total estimated roundtrip delay
- to the root of the synchronization subnet (synchronizing distance), as
- well as the sum of the total dispersion to the root of the
- synchronization subnet (synchronizing dispersion).
-
- This synchronizing distance has a rather loose definition. I believe the
- current NTP RFC suggests using ten times the mean expected error for
- the synchronizing distance. If this parameter is important to the NTP
- algorithm I would expect some stronger specification. Also, where does
- the value ten come from? I know its experimentally derived and seems to
- work...
-
- > I must have confused you. Both the distances and dispersions are formally
- > defined in the spec. The factor of ten applies only in cases where the
- > delay and/or dispersion cannot be measured, such as with some timecode
- > receivers. Elsewhere throughout the subnet these quantities are
- > calculated. You will observe that the dispersion quantity is rather
- > artfully concocted (for efficiency reasons) and not directly convertible
- > to the usual second-moment statistics. Well, all I can say is that other
- > practitioners of these black arts mumble similar voodoo, but the
- > performance as error estimator is still pretty good. Now, I should make
- > clear that a goal of NTP is to maintain overall accuracy relative to
- > the synchronization distance (roundtrip delay) to the root of the subnet
- > on the order of one-tenth that distance. That is an arbitrary goal, but
- > believed achievable on the basis of past experience.
-
- > There is an interesting feature which becomes evident reading the DTS
- > and NTP specification documents. The DTS and NTP procedures for reading
- > server times, computing bounds and selecting sources are roughly the
- > same complexity, although NTP fiddles with both delay and dispersion.
- > In addition, the procedures DTS uses to adjust the local clock, compute
- > the correction interval and determine the next update time have roughly
- > the same complexity as the NTP local-clock procedure.
-
- These quantities are
- included in the message exchanges and form the basis of the likelihood
- calculations. Since they always increase from the root, they can be used
- to calculate accuracy and reliability estimates, as well as to manage
- the subnet topology to reduce errors and resist destructive timing
- loops.
-
- While you state the synchronizing distance and synchronizing dispersion
- can be used to calculate accuracy, I have never seen a derivation of how
- this could be done. This is one of the recurring points, the lack of
- formal proofs.
-
- > Formal proofs are hard to come by, unless you make drastic assumptions
- > on the statistical models and distributions operative in the Internet.
- > Certainly, by the same sort of analysis presented in the DTS spec, the
- > notion of correct UTC time (for a truechimer) belonging to the interval
- > defined as the offset estimate +-1/2 the delay estimate is valid, as
- > long as the frequency estimate is within the stated tolerance. However,
- > DTS and NTP differ substantially in the philosophy of the selection
- > algorithm, as explained in the text. Since your comments did not speak
- > directly to this issue and I suspect you remain unconvinced, a full
- > examination should await another time and place.
-
- > For an in-depth analysis of probabilistic models appropriate for
- > "well-behaved" timekeeping systems, see Appendix F of the latest spec
- > revision mentioned previously. You still might not like the results of
- > the analysis, since statistical models seldom give nondeterministic
- > results. I think you might not argue with a conclusion that accuracy
- > degrades with increasing synchronization distance and dispersion, but
- > might argue over the function that maps these numbers into acceptable
- > error bounds. For justification, see [MIL90a].
-
- In NTP the selection algorithm determines one or a number of
- synchronization candidates based on empirical rules and maximum-
- likelihood techniques. A combining algorithm determines the local-clock
- adjustment using a weighted-average procedure in which the weights are
- determined by offset sample dispersion.
-
- < (portion deleted)
-
- The next step is designed to detect falsetickers or other conditions
- which might result in gross errors. The pruned and truncated candidate
- list is re-sorted in the order first by stratum and then by total
- synchronizing distance to the root; that is, in order of decreasing
- likelihood. A similar procedure is also used in Marzullo's MM algorithm
- [MAR85]. Next, each entry is inspected in turn and a weighted error
- estimate computed relative to the remaining entries on the list. The
- entry with maximum estimated error is discarded and the process repeats.
- The procedure terminates when the estimated error of each entry
- remaining on the list is less than a quantity depending on the intrinsic
- precisions of the local clocks involved.
-
- A point which is not discussed here is that when NTP chooses to prune
- an entry, it can not determine if this entry's problem is that it
- comes from a bad clock (falseticker in your jargon), or experienced
- unusually large and asymmetric network delays. The latter case is
- something to be expected in normal operation, the former represents a
- problem which should be fixed. DTS uses the interval information to
- identify such bad clocks, and reports them. Since if a clocks interval
- doesn't intersect the majority it is clearly faulty. This is, of course,
- a MAJOR issue in distributed system management.
-
- > NTP can determine whether a peer or radio has not responded for a "long"
- > time or whether the problem is excessive dispersion. NTP implementations
- > do keep track of both and report when a peer or radio becomes selected
- > or deselected, reachable or unreachable and so forth. After watching peers
- > and radios of various manufacture continuously for several years and
- > experiencing what could be considered most bizarre behavior on occasion,
- > I have concluded there is no way to reliably distinguish a falseticker
- > from simple excessive delay or propagation variance on other than a
- > a probabilistic basis. I claim this even after admitting the fuzzball
- > timecode receiver drivers have an incredible array of consistency
- > checking and monitoring machinery which can and often does detect a
- > misbehaving peer or radio. I also conclude that radio design can
- > be vastly improved by providing detail signal-quality information in
- > the timecode itself. At one time the fuzzballs carefully and
- > exasperatingly logged and reported every little thing, like when a
- > peer or radio became unreachable or experienced excessive dispersion,
- > etc., but now these events are logged at the server and available only
- > if the remote monitoring program requests them.
-
- The fundamental assumption upon which the DTS is founded is Marzullo's
- proof that a set of M clocks synchronized by the above algorithm, where
- no more than j clocks are faulty, delivers an interval including UTC.
- The algorithm is simple, both to express and to implement, and involves
- only one sorting step instead of two as in NTP. However, consider the
- following scenario with M = 3, j = 1 and containing three intervals A, B
- and C:
-
- A +--------------------------+
- B +----+
- C +----+
-
- Result +-----================-----+
-
- Using the algorithm described in the DTS functional specification, both
- the lower and upper endpoints of interval A are in M-j = 2 intervals,
- thus the resulting interval is coincident with A. However, there remains
- the interval marked "=" which contains points not contained in at least
- two other intervals. The DTS document mentions this interesting fact,
- but makes a quite reasonable choice to avoid multiple intervals in favor
- of a single one, even if that does in principle violate the correctness
- assumptions.
-
- Come on, this in no way violates the correctness assumption. The
- proofs tell us that the correct time is somewhere in the two dashed
- sub-intervals. By making the statement that the time is somewhere in the
- larger interval, a server is making a WEAKER assertion. Marzullo's proof
- would apply and the algorithm would work (sub-optimally) if servers
- arbitrarily lengthened the intervals they computed.
-
- > Zounds, you have cut me to the quick. My conclusion was based on my
- > reading of the text in Section 3.3 of the DTS spec and the stated
- > algorithm, which seemed at first reading to me at variance with
- > Marzullo's principles presented in the CACM paper. In your algorithm
- > you arrange the endpoints in a list in order of indicated times, with
- > lower bounds preceding upper bounds of the same value. For M-j = 2
- > and the above figure, the algorithm will start at the lower limits of
- > A and B and work upward, then start at the upper limits of A and C and
- > work down. The first step will conclude the lower limit as the lower
- > limit of intervals A and B and the upper limit as the upper limit of
- > intervals A and C. Your correctness assumption uses "the smallest
- > single interval containing all points in at lease M-f of the intervals,
- > which is exactly what your algorithm computes. I can restate that by
- > saying you require at least one clock interval to include UTC, not that
- > each of the M-j = 2 clocks agrees to the same interval. As I recall,
- > Marzullo's paper did not consider this case, but it is a natural
- > extension. I conclude my claim is unfounded and it will disappear in the
- > rewrite.
-
- > (portion deleted)
-
- In point of fact, the local clock model described in the NTP
- specification is listed as optional in the same spirit as the model
- described in the DTS functional description. As such, the local clock
- can in principle be considered implementation specific and not part of
- the formal specification.
-
- This is a rather odd statement. What I read is that the local clock
- model is not explicitly required by the NTP documents, but it is, in fact,
- required in functioning implementations.
-
- > The intent in the original NTP spec was to define the protocol itself,
- > saving the filtering, selection, combining and local-clock algorithms
- > for later specification exercises. As a pragmatic matter, nobody would
- > implement NTP unless there was some guidance for these algorithms. As
- > the architecture and protocol was refined, it became clear that a
- > well performing system of clocks could not be achieved unless
- > certain aspects of these algorithms were standardized, namely the
- > parameters of the local-clock algorithm, which is at the heart of
- > the stability issue. You correctly observe that the NTP spec is
- > confusing in this area.
-
- However, as demonstrated above, frequency
- compensation requires the local clock adjustment to be carefully
- specified and implemented. The NTP mechanism has been carefully
- analyzed, simulated, implemented and deployed in the Internet, but DTS
- has not.
-
- I have never read a clear specification of the required quality of the
- input time to NTP. However, the following argument shows that in a LAN
- of typical machines, DTS can indeed provide time to NTP. The clock
- resolution of most machines is between 1 and 16.7 milliseconds. Thus,
- any single measurements made by NTP MUST experience this clock jitter.
- NTP can achieve better overall results only by averaging many such
- measurements. We have measured the 'jitter' of DTS times in LANs, it is
- less than 10 milliseconds, so if DTS supplies time to NTP in a typical
- LAN, the NTP will receive time similar in quality to the time it gets
- from other NTP servers. In the WAN case, the jitter may be a problem,
- I assume that to interoperate in the presence of WAN links may require
- clock training.
-
- If you could provide the derivation of accuracy from synchronization
- distance and synchronization dispersion that you allude to in section 4.2,
- this could form the basis of reliable interoperation with NTP supplying
- time to DTS. Alas, I suspect such a derivation is unachievable. However,
- for installations which are not concerned with the DTS guarantee, the
- time provider interface could be used to import NTP time into DTS (just
- like any time provider, though there would have to be a user supplied
- inaccuracy, based on local experience with NTP). We intend to include a
- sample time provider program to permit this.
-
- > As I said previously, and subject to the assumptions made there, the
- > NTP synchronizing distance is computed similar to the DTS inaccuracy
- > interval. However, a derivation of estimated error interval from measured
- > distance and dispersion is not achievable on other than a statistical
- > basis, which wouldn't do you much good. However, there is a basic
- > flaw to your argument in achieving interoperability with NTP. The
- > NTP architecture involves a probabilistic system of mutually coupled
- > oscillators controlled by what is called in traditional control theory
- > a type-II phase-lock loop (PLL). A type-II loop is necessary to estimate
- > frequency, as well as phase. If you accept the requirement that the
- > subnet of distributed oscillators must operate plesiochronously
- > (phase-locked to possibly many reference oscillators themselves slaved,
- > but not phase-locked to UTC), then you are stuck with type-II loops.
-
- > The fundamental problem with type-II loops is that they can become
- > unstable and sail off into the wild blue yonder if the loop time
- > constants are not maintained within specified tolerances. There is
- > much machinery in the NTP local-clock model that addresses these
- > issues in order to maintain stability throughout the subnet. It has
- > been the experience that stability can be reliably maintained over a wide
- > range of network delays, outages, etc.; however, the cost is a tighter
- > specification on the dynamic characteristics of the local-clock
- > algorithms. See Appendix G of the cited NTP spec revision for a
- > mathematical analysis of the NTP PLL. Note that RFC-1119 contains
- > minor errors in some of the implementation formulas.
-
- > I would in fact be possible to "take time" from a DTS server and splice
- > it into NTP, in spite of the probably large phase noise; however, it
- > would probably not be possible to integrate a DTS subnet into an NTP
- > system where DTS was used for time transfer between one NTP subnet and
- > another.
-
- > (portion deleted)
-
- It is an uncontested fact that computer systems can be badly disrupted
- should apparent time appear to warp (jump) backwards, rather than always
- tick forward as our universe requires. Both NTP and DTS take explicit
- precautions to avoid the local clock running backwards or large warps
- when running forwards. However, both NTP and DTS models recognize that
- there are some extreme conditions in which it is better to warp
- backwards or forwards, rather than allow the adjustment procedure to
- continue for an outrageously long time. The local clock is warped if the
- correction exceeds an implementation constant, +-128 milliseconds for
- NTP and ten minutes for DTS. The large difference between the NTP and
- DTS values is attributed to the accuracy models assumed.
-
- I believe the difference also comes from different assumptions of the
- risks (and probabilistic costs) involved in jumping the clock. We assume
- it is something you want to do rarely.
-
- > The NTP experience is that, with a +-128-ms window and the Internet
- > peers I watch, I have not observed a jump any time over the last couple
- > of years, except upon reboot or upon insertion of the latest leap
- > second, when a couple of silly implementation bugs were found. Some
- > users have found it necessary to upsize the window on combined
- > satellite/landline paths and on paths frequently experiencing severe
- > network congestion. In fact, we have used up to +-512 ms on some paths
- > to Europe and would be glad to use larger ones should that become
- > necessary. I think this is a non-issue with respect to comparing
- > the NTP and DTS models.
-
- For most servers and transmission paths in the Internet a offset spike
- (following filtering, selection and combining operations) over +-128
- milliseconds following filtering, selection and combining operations is
- so rare as to be almost negligible.
-
- The duplicated text makes me think there is something wrong here, though
- frankly I don't understand what this paragraph is trying to say.
-
- > Probably awkwardly stated, what I'm trying to say is that the combining
- > and local-clock algorithms have the effect of reducing apparent errors
- > following the clock filter by a substantial amount over the "few
- > tens of milliseconds" assumed by conventional wisdom. See [MIL90a].
-
- > (portion deleted)
-
- The service objectives of both NTP and DTS are substantially the same:
- to deliver correct, accurate, stable and reliable time throughout the
- synchronization subnet. However, as demonstrated in this document, these
- objectives are not all simultaneously achievable. For instance, in a
- system of real clocks some may be correct according to an established
- and trusted criterion (truechimers) and some may not (falsetickers).
- the models used by NTP and DTS the distinction between these two groups
- is made on the basis of different clustering techniques, neither of
- which is statistically infallible. A succinct way of putting it might be
- to say that NTP attempts to deliver the most accurate, stable and
- reliable time according to statistical principles, while DTS attempts to
- deliver validated time according to correctness principles, but possibly
- at the expense of accuracy and stability.
-
- I would claim you're understating DTS's goals of autoconfigurability
- and manageability.
-
- > I would be glad to elevate the consciousness of this issue in the
- > rewrite.
-
- In both the NTP or DTS models the problem is to determine which subset
- of possibly many clocks represents the truechimers and which do not. An
- interesting observation about both NTP and DTS is that neither attempts
- to assess the relative importance of misses (mislabelling a truechimer
- as a falseticker) relative to false alarms (mislabelling a falseticker
- as a truechimer). In signal detection analysis this is established by
- the likelihood ratio, with high ratios favoring misses over false
- alarms. In retrospect, it could be said that NTP assumes a somewhat
- lower likelihood ratio than does DTS.
-
- I'm not sure I understand your jargon here. The important trade off
- for DTS is to notify managers of broken clocks (calling a falseticker
- a falseticker) so that it can be fixed. Declaring a good clock bad
- (labeling a truechimer a falseticker) could only occur in DTS as an
- implementation error or as a massive multi-server failure. In either
- case a human will have to get involved.
-
- > Likelihood ratio is a tool of mathematics and estimation theory and
- > is frequently used in statistical signal transmission and detection.
- > The likelihood of an event can be computed from the probability
- > model and assumptions about the underlying events of that model.
- > For example, there are four possible outcomes of a probabilistic
- > hypothesis that purports to reveal the results of an experiment:
- > (1) you said it hit and it really hit, (2) you said it missed and it
- > really missed, (3) you said it hit, but it really missed and (4) you said
- > it missed, but it really hit. Now, a complete probabilistic analysis
- > would require you place weights on each of these possible outcomes,
- > from which you can determine the overall success of your hypothesis
- > construction technique. This is where the likelihood ratio comes in.
-
- It might be concluded from the discourse in this document that, if the
- service objective is the highest accuracy and precision, then the
- protocol of choice is NTP; however, if the objective is correctness,
- then the protocol of choice is DTS. However, the discussion in Section
- 4.2 casts some doubt either on this claim, the DTS functional
- specification or this investigator's interpretation of it.
-
- I believe you are doing your position a disservice by raising this
- red-herring. No one has found your argument that DTS violates the
- assumptions of Marzullo's thesis convincing. Lamport commented that
- it indicates a serious misunderstanding of Marzullo's proof.
-
- > The last sentence should be struck and tell Leslie I said "hi."
-
- It is
- certainly true that DTS is "simple" and NTP is "complex," but these are
- relative terms and the complexity of NTP did not result from accident.
- That even the complexity of NTP is surmountable is demonstrated by the
- fact that over 2000 NTP-synchronized servers chime each other in the
- Internet now.
-
- The ever decreasing cost of time providers argues heavily for a simple
- solution, even though it may require more time providers. It simply isn't
- worth a lot of software complexity, (and maintenance cost, and management
- cost) to avoid spending a few dollars to buy more providers. Further,
- the philosophy of 'correctness' leads to certifiable implementation by
- independent vendors.
-
- > I continue to believe it is not constructive to "certify correctness" in
- > probabilistic systems, only to exchange acceptable tolerance bounds for
- > acceptable error bounds. If by "time providers" you imply each is
- > associated with a radio clock, I do not think it likely that the
- > cost of a radio clock will plummet to the point that every LAN can
- > afford one and, even if it did, you can not trust a single radio. You
- > have to have more than one of them and, preferably, no common point
- > of failure between them.
-
- > (portion deleted)
-
- The widespread deployment of NTP in the Internet seems to confirm that
- distributed Internet applications can expect that reliable, synchronized
- time can be maintained to within about two orders of magnitude less than
- the overall roundtrip delay to the root of the synchronization subnet.
- For most places in the Internet today that means overall network time
- can be confidently maintained to a few tens of milliseconds [MIL90a].
- While the behavior of large-scale deployment of DTS in internet
- environments is unknown, it is unlikely that it can provide comparable
- performance in its present form. With respect to the future refinement
- of DTS, should this be considered, it is inevitable that the same
- performance obstacles and implementation choices found by NTP will be
- found by DTS as well.
-
- I disagree with this final paragraph. I think that NTP and DTS both attain
- their very different goals. Our difference of opinion is in how important
- the different goals are. I accept that DTS will not keep clocks quite as
- tightly synchronized as NTP. It will, however, be a product that a vendor
- can confidently ship to customers who are expected to install, configure
- and manage it themselves.
-
- > We sure do have vastly different goals. Mine is a scientific one. I am
- > keenly interested in the technology of synchronizing time and frequency
- > to the highest degree of performance possible in the present state of
- > the art. I have found it useful in my own research to promote and
- > sustain an agenda to systematically refine NTP as an architecture,
- > protocol and set of implementations and promote its establishment as
- > an Internet Standard protocol. I also find it useful to promote, help
- > run and mount experiments with a largish number of Internet hosts which
- > now find NTP useful. I do not have a commercial agenda, nor do I have a
- > particular interest in the standards process other than to hope whatever
- > lessons learned in almost a decade of Internet timekeeping are documented
- > and made available to the R&D community. You may have seen my message to
- > the OSF in which I said the same thing and my hope that you guys, who
- > well might own the standard of choice, thoughtfully consider the points
- > I raise and think about how those features you think valuable in the
- > long run might be anticipated now and perhaps added at some future time.
-
- (remainder deleted)
-
- Dave
-